A marketing partner reached out to us to see if past player behavior can be predictive of future purchases. After connecting with analytics peers, I received the following data about players:
• player_sample contains a set of players ids and their purchase date of FIFA 18 (or NULL if the player didn’t purchase the game)
• player_purchases contains the purchase date of all the games owned by the player, as well as the game genre and the launch date of the game.
• player_logins contains the dates when the players logged in (some of) the games they own. For each login date, the number of logins that occurred on that day is reported.
• player_spend contains the dates when the player purchased some extra content for the games that they own.
Using the available data, I will present you with the most interesting insights & a machine learning model to predict the probability ‘that a player will purchase FIFA 18’ & evaluate this to determine whether the performance of the model is acceptable or not? In the end, I will mention a couple of suggestions as the next steps to improve the performance of the model.
For building a predictive model we must need to understand the purchase behavior of a player as it is the best predictor of repeat purchasing and loyalty. While it is measured in different ways, depending on industry and customer lifecycle, all database marketers covet its empirical facts, for this challenge we will focus on how often users purchase a game, login to a game, and shop more for additional game accessories. Further, purchase behavior is about how much they spend (purchase a game/accessories.), how much they use (log in to a game), and in what combination or sequence. Purchase behavior codifies both the tenure as well as the recency of your relationship with your customers.
Recency, Frequency, & Monetary Value (RFM) is a quick, descriptive way to segment a marketing database on purchasing behavior.
Recency (R): is the time since the last purchase (here it is purchasing previous versions of FIFA), or meaningful transaction, that our player makes. The more recent the last action, the higher the likelihood our players will respond to the next campaign, promotion, etc. This value includes the number of days player took to purchase since the launch date of previous version (of FIFA).
Frequency (F): is how often a user has a login to a game. It is typically a “Life to Date” field, and thus would be the accumulation of all logins from original date to the updated date. For this problem, this value includes the aggregate sum of logins.
Monetary Value (M) is the sum of all revenue earned from a user. Judgment is used to decide between “Life to Date” dollars, “Average” dollars, or some dollar amount over time. For this problem, this value includes monetary value brought by the player either by purchasing a complete version (of previous FIFA-versions) or by additional accessories (or both).
Step 1: Descriptive Analytics, which refers to a critical process of performing initial investigations on data so as to discover patterns, spot anomalies, to test hypothesis, and to check assumptions with the help of summary statistics and graphical representations.
Step 2: Data Wrangling, which is the process of cleaning, structuring, and enriching raw data into the desired format for better decision making. Our model will perform better by feeding Recency, Frequency, & Monetary Value (RFM). In this step, we convert the data into RFM. We then feed this to our model to predict the probability.
Step 3: Predictive Analytics, here we use statistical models to understand the future purchase behavior from the past trend. We will build a logistic regression model & evaluate the performance.
Step 4: Player Segmentation, here we use a mixture of marketing and machine learning techniques to advise on possible outcomes. We will segment users and look at the group to find the best fit for the target audience for FIFA 19.
Step 5: Present the key findings and suggestions as the next steps to improve the performance of the model.
packages<- c("knitr", # To Knit the code into HTML output
"sqldf", # To connect R with SQL
"dplyr", # Data Wrangling
"magrittr", # Display data on the screen
"Amelia", # Data Visualisation
"tidyverse", # Data Visualisation
"plotly", # Interactive plots for data visualization
"InformationValue", # For calculating Optimal Cut-off
'lubridate', # For date-time manupulation
'car', # To calculate multicolleniarity
'boot', # To perform cross-validation
'cluster', # clustering algorithms
'factoextra', # clustering algorithms & visualization
'widgetframe'
)
for (pkg in packages){
if (!(pkg %in% installed.packages()[, "Package"]))
{install.packages(pkg); }}
# Load the required packages into session
for (pkg in packages){
library(pkg,character.only=TRUE,quietly=TRUE)}
Required packages are installed.
#Loading Data
player_logins <- read.csv('player_logins.csv')
player_purchases <- read.csv('player_purchases.csv')
player_spend <- read.csv('player_spend.csv')
player_sample <- read.csv('player_sample.csv')
Loading player_logins, player_purchases, player_spend & player_sample into R-Dataframes.
Let’s look at the distribution of purchases accross games.
player_purchases_eda <- player_purchases
player_purchases_eda$count <- c(1)
plot_ly(player_purchases_eda, x =~game_name, y=~count,type="bar",mode="markers", colors = "Set1") %>%
layout(title = "Game perchases",
yaxis = list(title = "Purchases",showgrid = T),
xaxis = list(title = "",showgrid = F))
FIFA is most purchased game, let us look at the percentage of FIFA purchases.
print(paste('Percentage of FIFA purchases : ',sum(player_purchases_eda$count[player_purchases_eda$game_name=='FIFA 17'|player_purchases_eda$game_name=='FIFA 16'|player_purchases_eda$game_name=='FIFA 15'|player_purchases_eda$game_name=='FIFA 14'])*100/sum(player_purchases_eda$count)))
## [1] "Percentage of FIFA purchases : 39.6774193548387"
Close to 40 % of the purchases are FIFA.
Let’s now look at login’s.
player_logins_eda <- aggregate(login_count ~ game_name, player_logins, sum)
plot_ly(player_logins_eda, x =~game_name, y=~login_count,type="bar",mode="markers", colors = "Set1") %>%
layout(title = "Number of logins per game",
yaxis = list(title = "logins",showgrid = T),
xaxis = list(title = "", showgrid = F))
Once again FIFA dominates all other games.
print(paste('Percentage of users logging-in to play FIFA : ', sum(player_logins_eda$login_count[player_logins_eda$game_name=='FIFA 17'|player_logins_eda$game_name=='FIFA 16'|player_logins_eda$game_name=='FIFA 15'|player_logins_eda$game_name=='FIFA 14'])*100/sum(player_logins_eda$login_count)))
## [1] "Percentage of users logging-in to play FIFA : 72.1027020474604"
There are two important observations noted here:
1) Over 72 % of the game logins are from FIFA.
2) There are some users who logged into FIFA 18, let’s look at the number in the following section.
print(paste('Unique users logged in to FIFA 18:', length(player_logins$login_day[player_logins$game_name=='FIFA 18'])))
## [1] "Unique users logged in to FIFA 18: 43"
This is surprising and raised a new question, is there any player(s) who played the game even before purchase?
lets find out !!!
# Subsetting dataframes for our business problem.
player_logins <- subset(player_logins, game_name == 'FIFA 14' | game_name == 'FIFA 15' | game_name == 'FIFA 16' | game_name == 'FIFA 17')
player_purchases <- subset(player_purchases, game_name == 'FIFA 14' | game_name == 'FIFA 15' | game_name == 'FIFA 16' | game_name == 'FIFA 17')
player_spend <- subset(player_spend, game_name == 'FIFA 14' | game_name == 'FIFA 15' | game_name == 'FIFA 16' | game_name == 'FIFA 17')
Focusing on FIFA users only. So, sliced the data, i.e… including only observations related to FIFA.
#trial -ve
users_NOT_purchased_but_loggedIn<- sqldf('SELECT * FROM player_logins t1
LEFT JOIN player_purchases t2
ON t1.id = t2.id AND t1.game_name = t2.game_name
WHERE purchase_date is null')
#trial +ve
users_logged_before_purchase_and_purchased <- sqldf('SELECT * FROM player_logins t1
LEFT JOIN player_purchases t2
ON t1.id = t2.id AND t1.game_name = t2.game_name
WHERE purchase_date > login_day')
print(paste('Users who logged-in to the game before launch date and purchased: ', length(unique(users_logged_before_purchase_and_purchased$id))))
## [1] "Users who logged-in to the game before launch date and purchased: 975"
print(paste('Users who logged-in to the game before purchase and NOT purchased: ', length(unique(users_NOT_purchased_but_loggedIn$id))))
## [1] "Users who logged-in to the game before purchase and NOT purchased: 966"
#trial_users_not_purchased[is.na(trial_users_not_purchased)] <- 0
trial_users_not_purchased <- subset(users_NOT_purchased_but_loggedIn, select = c('id'))
trial_users_purchased <- subset(users_logged_before_purchase_and_purchased, select = c('id'))
trial_users<-unique(Reduce(function(x, y) merge(x, y, all=TRUE), list(trial_users_not_purchased, trial_users_purchased )))
We have 1471 unique users logging into the game before the launch date.
I’m assuming they are either trial/beta users.
Now let’s look at our sample data to check the behavior of these users.
#Understanding Sample Data
player_sample$purchase_date <- sub("^$", 0, player_sample$purchase_date)
player_sample$purchase_date[player_sample$purchase_date != 0] <- 1
player_sample$purchase_date <- as.numeric(player_sample$purchase_date)
colnames(player_sample) <- c('Purchase', 'id')
player_sample data set has 2500 purchased and 2500 non-purchased players, a perfectly balanced dataset.
player_sample_trial = data.frame(subset(player_sample, (id %in% trial_users$id)))
player_sample_without_trial = data.frame(subset(player_sample, !(id %in% trial_users$id)))
print(paste('Users who logged-in to the game before launch date (trial Users): ', length((player_sample_trial$id))))
## [1] "Users who logged-in to the game before launch date (trial Users): 1471"
print(paste('Users who logged-in to the game after purchase (Normal Users): ', length((player_sample_without_trial$id))))
## [1] "Users who logged-in to the game after purchase (Normal Users): 3529"
1471 unique users who logged-in to the game before purchase (in player_sample), lets call them trial users.
Separating trial users from normal users in player_sample to check their purchase behavior.
#understanding trial-users
print(paste('Percentage of purchases in trial users: ',sum(player_sample_trial$Purchase)/length(player_sample_trial$id)))
## [1] "Percentage of purchases in trial users: 0.77906186267845"
Conclusion : 78% of trial users purchased FIFA 18. This is an excellent catch to decide on marketing trial versions.
#How many unique players in this dataset?
print(paste('Unique users in player_logins: ',length(unique(player_purchases[["id"]]))))
## [1] "Unique users in player_logins: 3338"
#Structure of the data
str(player_purchases)
## 'data.frame': 4920 obs. of 5 variables:
## $ game_name : Factor w/ 33 levels "BATTLEFIELD 1",..: 9 10 10 9 9 10 9 10 9 9 ...
## $ id : Factor w/ 4838 levels "0005f889da556e88fb52f4a38d9fd78892121254",..: 4223 4563 4306 1597 2142 3583 187 4016 3683 285 ...
## $ game_genre : Factor w/ 6 levels "Action","Fighting",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ game_launch_date: Factor w/ 18 levels "","2013-10-29",..: 5 13 13 5 5 13 5 13 5 5 ...
## $ purchase_date : Factor w/ 729 levels "2015-09-02","2015-09-03",..: 249 401 642 147 28 394 60 663 142 21 ...
#Converting 'game_launch_date' & 'purchase_date' from factor format to Data format
player_purchases$game_launch_date <- ymd(player_purchases$game_launch_date)
player_purchases$purchase_date <- ymd(player_purchases$purchase_date)
str(player_purchases)
## 'data.frame': 4920 obs. of 5 variables:
## $ game_name : Factor w/ 33 levels "BATTLEFIELD 1",..: 9 10 10 9 9 10 9 10 9 9 ...
## $ id : Factor w/ 4838 levels "0005f889da556e88fb52f4a38d9fd78892121254",..: 4223 4563 4306 1597 2142 3583 187 4016 3683 285 ...
## $ game_genre : Factor w/ 6 levels "Action","Fighting",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ game_launch_date: Date, format: "2015-09-22" "2016-09-27" ...
## $ purchase_date : Date, format: "2016-05-07" "2016-10-07" ...
We can see that there are 3338 unique users who had purchased previous versions of FIFA.
Now let’s look at the structure of our data frame, purchase_date attribute is taken as a factor, we converted this to date format, which will be used to build Recency.
#Dealing with the missing data
player_purchases$game_launch_date[player_purchases$game_name == 'FIFA 14'] <- as.Date('2013-09-23')
With a brief inspection, we see game_launch_date is missing. Imputing this with a quick research on the FIFA 14.
#Calculating recency with 'game_launch_date' & 'purchase_date' for the previous versions
player_purchases$Recency <- as.numeric(
difftime(player_purchases$purchase_date, player_purchases$game_launch_date+4, units = "days")
)
player_purchases_agg1 <- aggregate(Recency ~ id, player_purchases, mean) # Used as Recency
Now we calculated recency from player_purchases.
Recency (R): is the time since last purchase, here it is purchasing previous versions of FIFA. This value is the difference (days) between launch day to purchase day.
Calculating Recency:
Subtracting Game launch date from the purchase date.
Aggregating average by grouping with player id.
player_purchases <- player_purchases %>%
mutate(Money_spent = case_when(
game_name == 'FIFA 14' ~ 100,
game_name == 'FIFA 15' ~ 110,
game_name == 'FIFA 16' ~ 121,
game_name == 'FIFA 17' ~ 133.1
)
)
player_purchases_agg2 <- aggregate(Money_spent ~ id, player_purchases, sum) # Building Monetary Value
player_purchases_agg <- sqldf('SELECT * FROM player_purchases_agg1
LEFT JOIN player_purchases_agg2
USING(id)')
Now we began to calculate Monetary value.
Monetary Value (M): is the revenue earned from a user. This value includes value brought by the player, either by buying a complete version (previous FIFA-versions) or by additional accessories.
Calculating Monetary Value: Part-1
Assuming a value, say 100, as the revenue earned from a player buying FIFA 14. Increasing 10% for each successive version.
Aggregating sum by grouping with player id.
Used SQL to join data frames.
#Looking at logins Data Frame
player_logins_agg <- aggregate(login_count ~ id, player_logins, sum) # Used as Frequency
colnames(player_logins_agg) <- c('id', 'Frequency')
# Identify outliers
#outliers <- boxplot(player_logins_agg$Frequency, plot = FALSE)$out
outliers <- max(player_logins_agg$Frequency)
# Remove outliers
player_logins_agg <-player_logins_agg[!(player_logins_agg$Frequency %in% outliers), ]
Frequency (F): is how often a user has a login to a game. This value includes the aggregate sum of logins.
Calculating Frequency
#Manipulating Send Data Frame
# Giving weights to the purchase wrt the version
player_spend <- player_spend %>%
mutate(spent = case_when(
game_name == 'FIFA 14' ~ 5,
game_name == 'FIFA 15' ~ 5.5,
game_name == 'FIFA 16' ~ 6.05,
game_name == 'FIFA 17' ~ 6.655
)
)
player_spend_agg <- aggregate(spent ~ id, player_spend, sum)
Calculating Monetary Value: Part-2
We assumed a value, say 100, as the revenue earned from a player buying FIFA 14. Now we make another assumption, average spend on accessories is 5 % of the product value.
Aggregating sum by grouping with player id.
purchase_spend_combined <- sqldf('SELECT * FROM player_purchases_agg2
LEFT JOIN player_spend_agg
USING(id)')
purchase_spend_combined$spent[is.na(purchase_spend_combined$spent)] <- 0
purchase_spend_combined$Monetary <- purchase_spend_combined$spent + purchase_spend_combined$Money_spent
purchase_spend_combined <- purchase_spend_combined[c(1,4)] # Used as Monetary
colnames(purchase_spend_combined) <- c('id', 'Monetary')
Calculating Monetary Value: Part-3
Now we join our results from Part-1 and Part-2.
Add them to get Monetary Value.
#Mergin Data Frames
player_sample_logins <- sqldf('SELECT * FROM player_sample
LEFT JOIN player_logins_agg
USING(id)')
player_sample_logins_spend <- sqldf('SELECT * FROM player_sample_logins
LEFT JOIN purchase_spend_combined
USING(id)')
player_combined <- sqldf('SELECT * FROM player_sample_logins_spend
LEFT JOIN player_purchases_agg1
USING(id)')
head(player_combined)
## Purchase id Frequency Monetary Recency
## 1 1 86b4ae7a794d4c196435cd0a63e83303cd3e6f34 297 267.41 0.5
## 2 1 3fd1574f04d7a380eaf913c9d86cb755d005d84c 360 133.10 2.0
## 3 1 c11e2a8ff857babd62628fcecc66340e9d828967 246 267.41 50.0
## 4 0 5f639924f04f27d3a49f290fcc5608fda1c58750 1 121.00 253.0
## 5 0 d1f943556959c23e9ad0eac81d6135c157de59ea 3 121.00 158.0
## 6 1 49023b8b764b9ecd0384c9d6bbc5c81e4bce7704 91 254.10 23.0
Now that we have seen all the data sets, it’s that time we integrate all and proceed to build a model.
#Removing trial Users from the data
player_combined = data.frame(subset(player_combined, !(id %in% trial_users$id)))
Removing trial users form our data, because including them may result in biased prediction. As many of the normal players might not get to use trial version.
Since we have some missing values in the combined dataset we will investigate as following.
library(Amelia)
missmap(player_combined, main = "Missing values vs observed")
The above plot helps us to understand the missing percentage in our combined dataset.
# Removing observations if entire row is blank
player_combined <- player_combined[rowSums(is.na(player_combined)) != ncol(player_combined)-2,]
#Replacing NA's in Activity & Monetary with 0
player_combined$Frequency[is.na(player_combined$Frequency)] <- 0
player_combined$Monetary[is.na(player_combined$Monetary)] <- 0
#
# #Filling NA's in Recency with Maximum Value
player_combined$Recency[is.na(player_combined$Recency)] <- max(player_combined$Recency, na.rm=TRUE)
Missing value imputation:
Removed entire observation if ‘Recency’, ‘Frequency’ & ‘Monetary Value’ is null.
Filling NA’s/Nulls with 0’s if its Frequency or Monetary, the maximum value of recency if it’s Recency.
missmap(player_combined, main = "Missing values vs observed")
#Standerdizing the data
player_combined_scaled <- transform(player_combined, Frequency = (Frequency - min(Frequency)) / (max(Frequency) - min(Frequency)))
player_combined_scaled <- transform(player_combined_scaled, Monetary = (Monetary - min(Monetary)) / (max(Monetary) - min(Monetary)))
player_combined_scaled <- transform(player_combined_scaled, Recency = (Recency - min(Recency)) / (max(Recency) - min(Recency)))
Scaled our observations for not letting the model weigh according to the magnitude.
#Arranging columns for interpretability
player_combined_scaled$Purchase <- as.numeric(player_combined_scaled$Purchase)
player_combined_scaled <- player_combined_scaled[c(2,5,4,3,1)]
head(player_combined_scaled)
## id Recency Monetary Frequency
## 2 3fd1574f04d7a380eaf913c9d86cb755d005d84c 0.004975124 0.05067167 0.1195869420
## 4 5f639924f04f27d3a49f290fcc5608fda1c58750 0.183368870 0.03214819 0.0000000000
## 5 d1f943556959c23e9ad0eac81d6135c157de59ea 0.115849325 0.03214819 0.0006662225
## 7 87b520c11ffceb799d0ac49a6867fc6fe3d071b2 0.060412225 0.05067167 0.1882078614
## 8 c6a8d9b15ba63fdc31d4cfc44f4e348698f4d0fd 0.067164179 0.24609438 0.1918720853
## 9 4a273dbc4840a8a20d8288cdeee23d043bfda43f 0.177683014 0.03214819 0.0063291139
## Purchase
## 2 1
## 4 0
## 5 0
## 7 1
## 8 1
## 9 0
#Train Test Split
# Set Seed so that same sample can be reproduced in future
set.seed(101)
# Now Selecting 75% of data as sample from total 'n' rows of the data
sample <- sample.int(n = nrow(player_combined_scaled), size = floor(.75*nrow(player_combined_scaled)), replace = F)
train <- player_combined_scaled[sample, ]
test <- player_combined_scaled[-sample, ]
print(paste('Training observations : ', length(train$id)))
## [1] "Training observations : 1551"
print(paste('Testing observations : ', length(test$id)))
## [1] "Testing observations : 518"
#Adjusting Test Data set
test_onlyPurchase <- subset(test, select = c("id","Purchase"))
test_withoutPurchase <- subset(test, select = c("id","Recency",'Frequency', "Monetary"))
Splitting our data as train and test data, to build and evaluate the model.
#Building machine learning model to predic the probability of a 'player purchasing FIFA 18'
mylogit <- glm(Purchase ~ Recency + Monetary+ Frequency, data = train, family = 'binomial')
We now build a Predictive model, by using known results to develop (or train), which will be used to predict probabilities for new data.
The modeling results in predictions that represent a probability of the target variable (purchase probability) based on estimated significance from a set of input variables.
summary(mylogit)
##
## Call:
## glm(formula = Purchase ~ Recency + Monetary + Frequency, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.2781 -0.9829 0.3279 1.0034 1.4988
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8961 0.1121 -7.997 1.28e-15 ***
## Recency 1.8091 0.4735 3.821 0.000133 ***
## Monetary 4.9795 0.7836 6.355 2.09e-10 ***
## Frequency 18.5706 1.8607 9.980 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2111.9 on 1550 degrees of freedom
## Residual deviance: 1782.5 on 1547 degrees of freedom
## AIC: 1790.5
##
## Number of Fisher Scoring iterations: 6
Model conclusions:
1) ‘Recency’, ‘Frequency’ & ‘Monetary Value’ are all significant i.e… they are all important to predict the probability of a new player.
2) Frequency is the most important factor, which means the user who loggs in the most will have a better probability of purchase.
#Variance Inflation Factor
vif(mylogit)
## Recency Monetary Frequency
## 1.037487 1.049440 1.080154
Multicollinearity means two or more variables feeding the same information into the model, which results in a biased probability. But, here we do not have that problem. The variance inflation factor for all three variables is < 10.
test_withoutPurchase$predicted_prob <- predict(mylogit, newdata = test_withoutPurchase, type = "response")
head(test_withoutPurchase)
## id Recency Frequency Monetary
## 2 3fd1574f04d7a380eaf913c9d86cb755d005d84c 0.004975124 0.119586942 0.05067167
## 8 c6a8d9b15ba63fdc31d4cfc44f4e348698f4d0fd 0.067164179 0.191872085 0.24609438
## 39 d86038e49f4422ed11698314c076a5d1073a6a7e 0.060412225 0.002664890 0.05067167
## 54 4e2abaa01891b00c3ec42640ad5477d675aced5e 0.286069652 0.036309127 0.26647021
## 59 00ed3557c6d9d978e89548e639ac23c786499636 0.169154229 0.007328448 0.05067167
## 66 2a6b8c854f0d27a2aa5f97afe4d6f6a130537119 0.143923241 0.084277149 0.23590646
## predicted_prob
## 2 0.8300501
## 8 0.9822589
## 39 0.3810612
## 54 0.8351451
## 59 0.4497404
## 66 0.8912907
Extracted probability for a testing observation as predicted_prob.
#Converting probabilities to classes with cutoff probability
optCutOff <- optimalCutoff(test_onlyPurchase$Purchase, test_withoutPurchase$predicted_prob)[1]
print(paste('Optimal cutoff point: ', optCutOff))
## [1] "Optimal cutoff point: 0.42994809103925"
test_withoutPurchase$predicted <- ifelse(test_withoutPurchase$predicted_prob >= optCutOff, 1, 0)
head(test_withoutPurchase)
## id Recency Frequency Monetary
## 2 3fd1574f04d7a380eaf913c9d86cb755d005d84c 0.004975124 0.119586942 0.05067167
## 8 c6a8d9b15ba63fdc31d4cfc44f4e348698f4d0fd 0.067164179 0.191872085 0.24609438
## 39 d86038e49f4422ed11698314c076a5d1073a6a7e 0.060412225 0.002664890 0.05067167
## 54 4e2abaa01891b00c3ec42640ad5477d675aced5e 0.286069652 0.036309127 0.26647021
## 59 00ed3557c6d9d978e89548e639ac23c786499636 0.169154229 0.007328448 0.05067167
## 66 2a6b8c854f0d27a2aa5f97afe4d6f6a130537119 0.143923241 0.084277149 0.23590646
## predicted_prob predicted
## 2 0.8300501 1
## 8 0.9822589 1
## 39 0.3810612 0
## 54 0.8351451 1
## 59 0.4497404 1
## 66 0.8912907 1
Used optimalCutoff function to transfer these probabilities into actions (1 - id purchase, 0 - no purchase).
set.seed(101)
cv_errors = data.frame(delta1 = 0, delta2 = 0)
for (i in 1:10) {
model_cv = glm(Purchase ~ Recency + Monetary+ Frequency, data = train, family = 'binomial')
cv_errors[i, ] = cv.glm(train, model_cv, K=10)$delta
}
print(paste('Training Accuracy for the model : ', round(1-mean(cv_errors$delta2),2)))
## [1] "Training Accuracy for the model : 0.81"
Performed K-fold cross-validation to evaluate the model. With over 80 % accuracy our model is ready to predict.
# Creating confusion matrix
confusionMatrix(test_onlyPurchase$Purchase, test_withoutPurchase$predicted, threshold = optCutOff)
## 0 1
## 0 132 33
## 1 76 277
The above Confusion Matrix helps us to visualize how training examples were classified by our model. Our testing set holds 518 observations out of which 409 observations were rightly classified.
# Plotting Miss Classification error
misClasificError=misClassError(test_onlyPurchase$Purchase, test_withoutPurchase$predicted, threshold = optCutOff)
print(paste('Accuracy on testing examples :',round(1-misClasificError, 2)))
## [1] "Accuracy on testing examples : 0.79"
print(paste('Missclassicafication Rate :',round(misClasificError, 2)))
## [1] "Missclassicafication Rate : 0.21"
Measured accuracy of testing examples, resulted in 79 % accuracy for the prediction.
#Plotting ROC curve
plotROC(test_onlyPurchase$Purchase, test_withoutPurchase$predicted_prob)
ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. It depends on the threshold value, when we decrease the threshold, we get more positive values and vice versa.
player segmentation is dividing a broad consumer base into sub-groups of users based on some type of shared characteristics. This will help our business to strategize the target audience for the pre-launch campaign (FIFA 2019).
#Dividing scaled variables into 4-quantiles
segmented <- player_combined_scaled
segmented$Frequency_Quantile <- ntile(segmented$Frequency, 4)
segmented$Monetary_Quantile <- ntile(segmented$Monetary, 4)
segmented$Recency_Quantile <- ntile(-1*segmented$Recency, 4)
#Concatinating Relevency, Frequency & Monetary.
segmented$Group <- paste(segmented$Recency_Quantile, segmented$Frequency_Quantile, segmented$Monetary_Quantile)
segmented_group <- subset(segmented, select = c('id', 'Group', 'Purchase'))
head(segmented_group)
## id Group Purchase
## 2 3fd1574f04d7a380eaf913c9d86cb755d005d84c 4 4 2 1
## 4 5f639924f04f27d3a49f290fcc5608fda1c58750 1 1 1 0
## 5 d1f943556959c23e9ad0eac81d6135c157de59ea 2 1 1 0
## 7 87b520c11ffceb799d0ac49a6867fc6fe3d071b2 3 4 2 1
## 8 c6a8d9b15ba63fdc31d4cfc44f4e348698f4d0fd 3 4 4 1
## 9 4a273dbc4840a8a20d8288cdeee23d043bfda43f 2 2 1 0
The simplest way to create player segments from RFM Model is to use Quartiles. We assign a score from 1 to 4 to Recency, Frequency, and Monetary. Four is the best/highest value, and one is the lowest/worst value. A final RFM score is calculated simply by combining individual RFM score numbers.
#Cluster Analysis
segmented <- subset(segmented, select = c('id', 'Recency','Frequency', 'Monetary', 'Group', 'Purchase'))
wssplot <- function(data, nc=18, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
wssplot(segmented[,2:4])
I would like to combine machine learning with RFM segmentation, to perform K-means clustering we visualize elbow-plot to select the right number of clusters.
After trial and error from 3 to 6 clusters, I choose to go with 3 clusters.
#K-means clustersing
set.seed(101)
kmeansCluster <- kmeans(segmented[,2:4], centers = 3, nstart = 25)
str(kmeansCluster)
## List of 9
## $ cluster : Named int [1:2069] 1 1 1 1 2 1 1 1 2 1 ...
## ..- attr(*, "names")= chr [1:2069] "2" "4" "5" "7" ...
## $ centers : num [1:3, 1:3] 0.1054 0.1125 0.4499 0.0287 0.1038 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:3] "Recency" "Frequency" "Monetary"
## $ totss : num 68.7
## $ withinss : num [1:3] 10.41 17.24 4.27
## $ tot.withinss: num 31.9
## $ betweenss : num 36.8
## $ size : int [1:3] 1408 508 153
## $ iter : int 3
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
#fviz_cluster(player_cluster_result, data = player_cluster[,1:2])
fviz_cluster(kmeansCluster, data = segmented[,2:4], geom = "point",
stand = FALSE, ellipse.type = "norm") +
theme_bw() + scale_y_continuous(labels = scales::comma) +
ggtitle(label='Player Clusters')
2-D visualization of clusters, we can see that clusters are not completely separable but I will try to utilize the best of this segregation.
for (i in 1:3) {
cluster <- segmented[kmeansCluster$cluster == i,]
cluster$cluster <- c(i)
if(i==1){
player_clusters<- cluster
} else {
player_clusters <- Reduce(function(x, y) merge(x, y, all=TRUE), list(player_clusters, cluster))
}
}
player_clusters <- subset(player_clusters, select = c('id','cluster'))
player_combined_clusters <- sqldf('SELECT * FROM segmented as t1
LEFT JOIN player_clusters as t2
USING(id)')
player_combined_clusters <- subset(player_combined_clusters, select = c('id', 'Group', 'Purchase' ,'cluster'))
head(player_combined_clusters)
## id Group Purchase cluster
## 1 3fd1574f04d7a380eaf913c9d86cb755d005d84c 4 4 2 1 1
## 2 5f639924f04f27d3a49f290fcc5608fda1c58750 1 1 1 0 1
## 3 d1f943556959c23e9ad0eac81d6135c157de59ea 2 1 1 0 1
## 4 87b520c11ffceb799d0ac49a6867fc6fe3d071b2 3 4 2 1 1
## 5 c6a8d9b15ba63fdc31d4cfc44f4e348698f4d0fd 3 4 4 1 2
## 6 4a273dbc4840a8a20d8288cdeee23d043bfda43f 2 2 1 0 1
Tagged cluster numbers to users.
player_combined_clusters$count <- c(1)
player_combined_clusters$cluster <- as.factor(player_combined_clusters$cluster)
plot_ly(player_combined_clusters, x =~Group, y=~count,type="bar",mode="markers", color= ~ cluster) %>%
layout(title = "Cluster values on segmented groups",
yaxis = list(title = "Count",showgrid = T),
xaxis = list(title = "Group",showgrid = F))
Conclusion:
1) These players bring medium monetary value to the company with medium number of logins, they segregated as in cluster 1. -Valuable users
2) Combination of Frequent logins and high monetary value users are in cluster two. -Active users
3) Users with low monetary value & logins with the highest recency score (purchased long back) are clustered in 3rd group. -Inactive users
for (i in 1:3) {
print(paste('Probability of purchase from cluster-: ',
i,
' : ' ,
round(sum(player_combined_clusters$Purchase[player_combined_clusters$cluster == i])/sum(player_combined_clusters$count[player_combined_clusters$cluster == i]),2),
' ::: User base',
round(sum(player_combined_clusters$count[player_combined_clusters$cluster == i])/sum(length(segmented$id)+length(trial_users$id)),2)))
}
## [1] "Probability of purchase from cluster-: 1 : 0.51 ::: User base 0.4"
## [1] "Probability of purchase from cluster-: 2 : 0.82 ::: User base 0.14"
## [1] "Probability of purchase from cluster-: 3 : 0.48 ::: User base 0.04"
By the result we can understand that Active users (cluster-2) purchase rate is over 80% with a 14% customer base, this is our most promising group for marketing.
Apart from the launch date, the next frequent purchases are recorded on Christmas, Dec 15.
From initial Exploratory Data Analysis, we observed 1471 unique users who logged in to the game before the launch date. Therefore, I assumed them as users who use trial version and then decide on a purchase.
Analysing trial/beta users resulted that over 75 % of these users purchased FIFA-18. Based on our assumption and result, we can consider this as a special group. If we market (ex: send Free trial) our next release to these users with 75 % expected conversion (purchases).
Though ‘Recency’, ‘Frequency’ & ‘Monetary Value’ are all significant i.e… they are all important to predict the probability of a new player, frequency stood out being the most important factor, i.e… the more frequent a user login to the game, more chances that he/she will purchase.
We have seen the importance of Recency, Frequency, & Monetary Value (RFM) in marketing to predict the purchasing behavior of an active player So, I transformed the given information into RFM and build Logistic Regression to extract the probability of purchase.
Metrics:
• Training Accuracy: 81.3 %
Training Accuracy is the accuracy of a model on examples it was constructed on.
Method Used : 10-fold cross-validation
With 10-fold cross-validation we divided our training set into 10 sets and pass 9 sets to build the model and measure accuracy with the 10th set, repeated 10 times. In simple terms, this is used to measure how well our model understood the pattern of our training examples.
• Testing Accuracy : 79 %
Test accuracy is the accuracy of a model on examples it hasn't seen.
Method Used : Confusion Matrix
Confusion matrix helps us to visualize how testing exams are classified when compared to an actual label. 132 users were correctly classified as 0 (will not purchase) & 277 users were correctly classified as 1 (will purchase) out of 518 users in the testing set.
• ROC AUC : 0.82
ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. It depends on the threshold value, when we decrease the threshold, we get more positive values and vice versa.
Machine learning model performance is relative and ideas of what score a good model can achieve only make sense and can only be interpreted in the context, we cannot achieve the best score (best ~ 100% accuracy), but it is good to know what the best possible performance is for our chosen measure. We have close to 80 % accuracy with a limited number of observations, this may increase by increasing the volume (number of observations).
1) 1471 users who are assumed to be trial users were removed from the model, building a dedicated predictive model on these users can result in predicting the purchase behavior of trial users.
2) More data, better prediction. In analytics, more data almost always lead to more stable prediction. We have 5000 observation and out of which 1471 observations were removed assuming them to be trial players, in addition to that 1460 users from the player_samples data set have no past behavior from player_logins, player_purchases, and player_spend, So I used 2069 in the final set to build predictive model, therefore I suggest to re-run with more data.
1) Our primary concentration should be on a player who actively responds and purchase, I recommend Active players (cluster-2) who are 14 % customer base and has purchase probability over 0.8. Remind them with advertisements.
2) Players who used trial period for previous versions are interesting, with a 41 % customer base, and close to 80 % purchase rate these players can be lucrative. Send free trial versions.
I would change my recommendation for the December campaign because in the above two recommendations I suggested going with the most reliable customers, of which 80 % would have already purchased the product by December.
For the December campaign I will recommend targeting valuable customers (cluster-1), who is from 40 % customer base, they might be intrigued by this campaign as they are not very frequent users, but provided holiday season they might purchase.